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COMPARISON OF SOME BIASED ESTIMATION METHODS (INCLUDING 
ORDINARY SUBSET REGRESSION) IN THE LINEAR MODEL 
by Steven M. Sidik 
Lewis Research Center 

SUMMARY 

Three major types of biased estimator have recently been proposed in the litera- 
ture: ridge, Marquardt's generalized inverse, and shrunken. Besides these newer 
biased estimators, we shall consider principal components regression and subset 
regression, which are also, in effect, biased estimation methods. We present the bi- 
ased and unbiased estimators of the parameters in a linear model. The presentation 

T 

centers on a duality of the X X matrix of the least-squares normal equations. 

We consider biased estimators with respect to all three major objectives of a linear 
model analysis: 

1. Estimation of parameters: (a) In a nearly singular system, the full parameter 
vector is essentially inestimable. However, certain linear combinations of the param- 
eters are estimable, (b) Biased estimators all place some kind of constraint on the 
parameter space in order to achieve "better" estimators, (c) The decision to use mean 
squared error as a criterion of goodness should be made independently of the existence 
of multicollinearity. (d) If mean squared error is to be accepted as a criterion of good- 
ness, only one estimator so far proposed has any proven optimality properties, (e) Be- 
cause distributional information is lacking, no biased estimator provides interval esti- 
mation capability. 

2. Estimation of predictive regression function: All biased estimators discussed 
offer the possibility of decreased mean squared error of the predictive regression func- 
tion. This decrease cannot be assured (except for two special cases of principal com- 
ponents estimators) because it is not known how to identify the members of each class of 
biased estimators that provide smaller mean squared error. 

3. Hypothesis testing of parameters: Only the ordinary least- squares estimators 
have enough of the distributional theory available to provide subset regression techniques 
in the original parameterization. 

The overall conclusion is that ordinary least- squares estimation and subset regression 
methods are still the preferred methods of linear model analysis in the regression situ- 
ation . 



INTRODUCTION 


Suppose as a result of experiment or observation you have accumulated a vector 

y n X l of an °b servec * variable which you believe may be expressed as a function of a 

matrix X of p other observed predictor variables . The standard linear model 
nxp 

assumes 


y = Xb + e 

where b pXl is an unknown parameter vector and where e nX ^ is an unobservable vector 
of errors of observation. Some of the objectives of an analysis of these data are 

(1) Obtain an estimate of b. You may be interested in either point or interval es- 
timates . 

(2) Predict values of y at some combination of the predictor variables. 

(3) Test if certain of the components of b may reasonably be set to zero. 

The use of multiple linear regression analysis in achieving these goals has been 
intensively studied by many authors. The most admirable single summary of theory and 
practice is Draper and Smith (ref. 1). 

Recently, some attention has been given to two aspects of regression analysis. The 
first aspect is the attempted improvement of point estimators where the criterion of 
goodness is mean squared error (Stein (ref. 2), James and Stein (ref. 3), and Scloye 
(ref. 4)). Sclove discusses an estimation technique which guarantees that the sum of 
component- wise mean squared errors of the biased estimator is smaller than that of the 
ordinary unbiased least-squares estimator. He also presents some further results 
under the restrictive condition that the terms of the equation can be ordered in impor- 
tance prior to analysis. These procedures can be somewhat difficult to implement and 
very little is known about the distributional properties of the resulting estimators. 

The second aspect of regression that has been considered recently is the problem of 
point estimation when there is a high degree of multicollinearity among the predictor 
variables (Hoerl and Kennard (refs. 5 and 6), Marquardt (ref. 7), Mayer and Willke 
(ref. 8), Kendall (ref. 9), and Massy (ref. 10)). 

Hoerl and Kennard (ref. 5) propose a class of biased estimators called ridge esti- 
mators. Their criterion of goodness is mean squared error. The technique is rela- 
tively easy to use, and it may be shown that the class contains estimators which have 
smaller mean squared error than the least-squares estimator. However, they are 
unable to provide a well-defined and unique choice of estimator from this class. Nor 
have they been able to prove that their suggested procedures actually choose a member 
of the class which achieves smaller mean squared error. In fact, Newhouse and Oman 
(ref. 11) have reported some Monte-Carlo simulation results which indicate that ridge 
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estimators do not in general perform better than least- squares estimators. There is 
the further drawback that the distributional properties of ridge estimators are incom- 
pletely known. 

Marquardt (ref. 7) introduces a class of biased estimators called generalized in- 
verse estimators. Mayer and Willke (ref. 8) discuss a number of classes of biased 
estimators called shrunken estimators. These classes have the property that they con- 
tain members with smaller mean squared error than the least- squares estimator. It is 
not known how to choose such members, however; and there is very little known about 
the distributional properties of these estimators. 

Kendall (ref. 9) and Massy (ref. 10) discuss principal components regression. 
Principal components regression was not introduced as a method of biased estimation, 
but it will be shown to provide biased estimators. The method is very closely related to 
Marquardt 's. Principal components regression was introduced for use when there is 
multicollinearity. 

In this report we consider a method of unifying the treatment of these biased esti- 
mation methods and of unbiased least- squares estimation. The presentation centers on 
a duality of the X X matrix of the normal equations for unbiased least- squares estima- 
tion. The duality is in the sense that the spectral decomposition of X T X into its eigen- 
space representation has the property of describing how well the data points are spread 
out in the data space. A similar decomposition of (X T X) _1 (or a generalized inverse of 
X X if it is singular) has the property of describing how the distribution of the esti- 
mator of b is spread out in the parameter space. We lean heavily on these decomposi- 
tions to discuss the interrelationships of all these estimation methods and to describe 
the consequences of using them. 

The report begins with a brief discussion of the objectives of a linear model analy- 
sis. The following section presents a summary of the standard estimation methods 
applied to linear models when X^X is not of full rank (i.e. , exactly singular) and also 
when X T X is of full rank. The section also discusses the duality of X T X in some 
detail. The next section summarizes the major biased estimators and some of their 
more important properties. In that section we also examine some of the relations among 
the estimators. After that we consider the mean squared error of the estimated regres- 
sion function for each of the biased estimators. This is followed by a discussion of 
hypothesis testing procedures available for each method. The last section presents two 
numerical examples to illustrate the results developed in this report. 

OBJECTIVES OF LINEAR MODEL ANALYSES 

The model that we are dealing with is 

y = Xb + e ( 1 ) 
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where 


y an n x 1 vector of observations 
b a p x 1 vector of unknown parameters 

X an n x p matrix of known values of p predictor variables for each of n observa- 
tions 

2 

e an n x 1 vector of random errors which we assume has N(0, a I) distribution 

In terms of this model we generally wish to consider any one or more of the following: 

(1) Find an estimate of b. The components of b might represent either physical 
constants or perhaps just empirical rates of change. We are usually concerned only 
with point estimates but occasionally desire interval estimates. 

(2) Predict a value of y for some point X 0 ( lxp j of the space of the predictor vari- 
ables. This is often the most important objective. 

(3) Test if certain of the components of b may reasonably be set to zero (or some 
other specified constant). If the parameters represent some physical constants, such 
tests may provide evidence for or against some theory. For purposes of prediction or 
control, if certain components of b can be set to zero, this implies the corresponding 
predictor variables have no effect on y and hence may be ignored. This provides sim- 
plicity and often economy. It also often reduces the variance of the estimated predicting 
equation. 

Most of the previous authors have considered only one of these objectives when de- 
veloping biased estimators. We will consider all three. 


LINEAR MODEL ESTIMATION PROBLEM 

Our model is that described by equation (1). For this model, it is well known that 
either least- squares or maximum- likelihood arguments lead to the minimum- variance 
unbiased estimator for b as any solution b^ to the normal equations 

X T Xb° = X T y ( 2 ) 

In fact, X T X will be either singular or nonsingular; but in practice we may consider 
X T X to be singular, nearly singular, or nonsingular. If X^X is singular, there are 
many solutions b^ to equation (2). The section Models Not of Full Rank describes es- 
timation concepts and procedures for this situation. If X^X is nonsingular, there is 
exactly one solution to equation (2). If X^X is nearly singular, then we might expect 
that there will be difficulty in deciding whether we have a solution or many poorly de- 
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fined solutions. A nearly singular X^X is a symptom of multicollinearity. The sec- 
tion Models of (Just Barely) Full Rank describes estimation concepts and procedures 
for the X X nonsingular and nearly singular situations. The last section describes 
the spectral decomposition of X T X and its interpretation. 


Models Not of Full Rank 

The study of linear models based upon generalized inverses for the X T X singular 
situation has been most lucidly presented by Searle (refs. 12 and 13). We use his nota- 
tion and an example given by Federer (ref. 14), which is discussed in chapter 5 of 
Searle (ref. 13), to review the basics. 

As stated previously, when X^X is singular, there is no unique solution b^ 1 to 
equation (2). We begin by letting G = (X^X) + be a generalized inverse of X^X. That 
is, G satisfies 

G = (X T X) + (3) 


and 


(X T X)G(X T X) = (X T X) 


(4) 


Let 


H = (X T X) + (X T X) 


(5) 


Then 


b° = (X T X) + X T y 

is a solution to equation (2) (not unique) such that the expectation of b° is 

E(b°) = Hb 


( 6 ) 


(7) 


and the variance of b^ is 


V(b°) = G(X T X)G T cr 2 


( 8 ) 


Since G and hence H are not unique, there are infinitely many solutions b° to equa- 
tion (2). An estimable function of the parameter vector b is any linear function of b 
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for which an estimator can be found from b° which is invariant to whatever the choice 
of G. 

Searle (ref. 12) has shown that all estimable functions are of the form 

w T Hb 


for arbitrary choice of w. There are, at most, r linearly independent estimable func- 
tions where r = rank(X T X). The estimator for w T Hb is given by w T GX T y, and this 

T 

is unbiased for w Hb, The variance is 


V(w T G T X T y) = w T GX T XG T wa 2 


The example we consider is discussed in Searle (ref. 13) where weights are pre- 
sented for six rubber plants, three of which are normal, two of which are off- type, and 
one of which is an aberrant. The data are presented in table I. The model we consider 
is 


where 


i] 


= jti + of-^X^ + oi 2^2 "** a 3-^3 ® 



if the plant is of the i th type 
otherwise 


We thus have the model represented as in equation (1), where 


yif 


10l" 




’l 

1 

0 

- 

0 

^12 


105 


>■ 


1 

1 

0 

0 

*13 

*21 

- 

94 

84 

b = 

a i 

“2 

X = 

1 

1 

1 

0 

0 

1 

0 

0 

y 22 


88 


“3 


1 

0 

1 

0 

*31. 


32 

L J 




1 

- 

0 

0 

1 

m 


( 9 ) 
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In this example we have 


(X T X) 


3 2 

3 0 

0 2 
0 0 


T T 1 

&nd it is seen that X X is singular and of rank 3. A generalized inverse of X X is 


G = {X L xf = 


0 

1/3 

0 

0 


0 

0 

1/2 

0 


( 10 ) 


and for this choice of G we have 


H = G(X 1 X) = 


(ID 


Hence, all estimable functions are of the form 


w Hb = (Wj + w 2 + w 3 )m + + w 2 n 2 + WgSg 


( 12 ) 


and there are, at most, three linearly independent choices of w. The unbiased esti- 
mates of w T Hb are given by w T GX T y = w^ + w 2 y 2 + w 3 y 3 . Three reasonable 
choices for independent estimable functions are provided in table II. 

To emphasize, the major point under discussion is that there is no unique solution 
to the normal equations. We could make it unique by imposition of a constraint such as 
a l + a 2 + a 3 = ^ °1 suc h constraints, it is often more useful to concentrate on 

choosing the w. values that lead to meaningful estimable functions. 


Models of (Just Barely) Full Rank 
T 

When X X is nonsingular, the estimation of b and the description of the distribu- 
tion of its estimator are more straightforward. For in this case it is well known that 



is unique and is the minimum -variance unbiased estimator. Also b has the following 
distribution: 

b ~ N '», (7 2 (X T X)' ; ] 

From properties of the multivariate normal distribution, it is well known that a set of 
linear combinations of b, say K T b, has the following distribution: 

K T b ~ n[k T Ij, ct 2 K T (X T X)" 1 k] (13) 

T 

Now, to consider what problems arise as we slowly bridge the gap from X X 
nonsingular to singular, suppose we modify the previous example. We just barely re- 
move it from the singular setting and perform a regression analysis. Suppose we per- 
form an experiment to study the abrasion resistance of rubber as a function of the 
amount of three particular additives. Let X i denote the amount in pounds of the i 
additive which is loaded with an approximately 1000- pound charge to the chemical re- 
actor which produces the rubber. 

The proposed model is 

y = Xb + e 

where 

0. 99 0 if 

1.00 0 (; 

1.01 0 o 

0 0. 99 l 

0 1.01 l 

0 0 *, 

y T ■= [-L y 2> y 3’ y 4> y 5’ y e] 

= [101, 105, 94, 84, 88, 32] 

b T = [jit, a v a 2 , of ;i ] 

and e is defined as previously (following eq. (1)). In this example we have 
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3 2 1 

3.0002 0 0 

0 2.0002 0 

0 0 1 

T 

and it is evident that X X is not singular. Yet, it is nearly so. Let X . denote the 
eigenvalues of X T X. When these data were submitted to NEWRAP (ref. 15), the follow- 
ing eigenvalues were calculated: 

= 8.41888 
X2 = 2.38695 
\ 3 = 1.19444 
x 4 = 0. 00010 

T 

Since X 4 « 0. 0, we see X X is nearly singular. 

The following parameter estimates were obtained 

li = 168. 

a 1 = - 68 . 

“2 = ' 81 - 
a 3 = -136 
T 12 

with a covariance matrix of (X X) a where 

2500.21 -2500.13 -2500.38" 

2500. 38 2499. 96 2500.21 

2499. 96 2500. 38 2500. 13 ( 15 ) 

2500.21 2500. 13 2501.38 

T 

It is evident that X X is formally of rank 4 although it is essentially of rank 3 and that 
the resulting (X X) matrix indicates a large variance in the parameter estimates. 

However, let us recall the estimable functions discussed in the previous section. 
Namely, let us compute 


2500.38 
rp , -2500.21 

< X X > = -2500.13 
-2500.38 
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- a 2 = 13.9530 (14) 
d 2 - d 3 = 54.0278 (54) I 

(2 + 1 (dj + a 2 + <$ 3 ) = 72. 6695 (72 | ) 


( 16 ) 


The numbers in parentheses are the corresponding values from the analysis of variance 
(ANOVA) situation. Allowing for the fact that X was slightly changed to provide non- 
singularity, the agreement is admirable. 

Although the variances of the raw estimates p, <Sj, ag, and d^ are quite large, 
let us consider the variances of the linear combinations of parameters in equation (16). 

T~ 

The linear combinations are defined by K b, where 


"0 

1 

-1 

0 


0 

0 

1 

-1 


1" 

1 

3 

1 

3 

1 

3. 


From equation (13) the covariance matrix of 


K T b 


is given by K T (X T X)' 1 Kor 2 . 


But 


K T (X T X) 


-1 



' 0. 82 

-0.33 

0.05' 

K = 

-0.33 

1.50 

0.17 


. 0.05 

-0. 17 

0. 53_ 


It is thus evident that, even though the full parameter vector is quite ill determined, the 
linear combinations of the parameters corresponding to the estimable functions of the 
ANOVA example are well determined. This is, of course, not at all surprising. 


Duality of X T X 

ryi 

The normal equations matrix X X plays the central role in linear model estimation 
and hypothesis testing. We first note that X T X has a spectral decomposition (ref. 16, 
p. 36) or representation as 
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( 17 ) 


X T X=E Wx' 

1=1 


T 

where X 1 > . . . >L >0 are the eigenvalues of X X and S. are the corresponding 
J - P rp T ^ 

normalized eigenvectors of X X. If r is the rank of X X, then we have a similar 

decomposition (ref. 16, p. 184) of (X T X) + as 


If r = p, then 



(18) 


(19) 


An extremely important point to note is that the spectral representations of equa- 
tions (17) and (18) are not invariant under linear transformations of X. Invariance may 
be attained by assuming that we always consider the linear model in its correlation form. 
That is, we will in the remainder of this report assume that X X has all its diagonal 
elements equal to unity. This may always be achieved by a simple linear transformation 
corresponding to changes in scale and/or location of the original predictor variables. 
Since the correlation matrix is always invariant with respect to changes of location and 
scale, the spectral decomposition of the correlation matrix will be unique and well de- 
fined. Thus, as long as we use the convention of reducing the model to correlation form, 
the representation is invariant to changes in location and scale of the predictor vari- 
ables. 

T 

The spectral decomposition of X X by equation (17) indicates how and how well the 
variables' space is spanned by the experiment. Namely, if ^-1.0 for all i, then in 
a sense the variables' space is perfectly spanned. If >> A p , then the variables' 
space is not well spanned. In fact, XqSj represents the linear subspace (or linear 
combination) of predictor variables which is spanned the best. And X Q S p represents 
the linear combination of variables which is most poorly spanned. In fact, if X p = 0, 
then XgSp is not spanned at all. These considerations are discussed by Kendall and 
Stuart (ref. 17, p. 287). 

In order to illustrate the preceding, consider the following two-dimensional example. 
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Suppose the data points observed are as plotted in figure 1. Assume that the dashed 
line is the line x^ - x 2 = 0 in the variables’ space and that £ 2 * s * ine 
x + x 0 = 0. Assume also that the two extreme points along £ 2 are equally distant 
from - x 2 - 0. 

It is immediately seen that the observations are much more spread out along ^ 

(i.e. , x x - x 2 = 0) than along £ 2 (i.e. , Xj + x 2 = 0). For such a situation we would find 

that A 1 > X 2 ; and we would say that X q S 1 is well spanned, while X q S 2 is poorly 
spanned. 

Turning to consideration of the parameter space, it is well known that the least- 
squares estimator b (under normal distribution theory) follows the normal distribution 
Nr(X T X) + (X T X)b, d 2 (X T X)tl. From reference 13 (p. 185) we have that, for a linear 

L r A J rp^ 

combination of the estimates w b, 

V (w T b) = W T (X T X)+Wff 2 

It can be shown (ref. 16, p. 501) that the choice of w which minimizes the variance of 
w T b is w = and that this variance is 

v(si&) = S^xbo+SjU 2 =^- 

X 1 

T- 

The choice of w which maximizes the variance of w b is w = and 

( T'\ rr^ 

V ( b ) =— (assuming p = r) 

v p / Xp 

Thus, S^b describes the most determined linear combination of the parameters, while 

S^b describes the least determined. In fact, if X = 0, then S^b is nonestimable and - 
p P P 

hence not determined at all. An interpretation of this is that S^b has infinite variance. 

r 

SOME BIASED ESTIMATORS 

We now describe briefly some of the biased estimators that have been proposed and 
some of their most important properties. 
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Ridge Estimators 


There are two forms of ridge estimator proposed by Hoerl and Kennard (ref. 5). 
One is a general form and the other is a form more useful for applications. 

In the general form, we begin with the model of equation (1) 

y = Xb + e 

We may represent X^X as X^X = PAP^, where P is the orthogonal matrix whose 
columns are the normalized eigenvectors of X T X and A is the diagonal matrix of 
eigenvalues. Considering the transformation to new predictor variables defined by 

W = XP 


and the model 


y = Wa + e 


we have 

a = P T b 
W T W = A 
a T a x b T b 

The general ridge estimation procedure is defined by the 'family of estimators indexed 
by the parameters ^ 0 


a* = (W T W + K)" 1 W T y (20) 

where the matrix K is defined by 

K » (Ojjty 

When all k. = 0, a* is the ordinary least- squares estimator and is unbiased. When any 
k i > 0, the resulting estimator for a is biased. We define the mean squared error of 

a* as 


M(K) = E [(a* - a) T (a* - a)] 
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It may be shown (see Hoerl and Kennard, ref. 5) that the choice of k. = cr /a. will 
minimize M(K) among the class of estimators defined by equation (20). Unfortunately, 
in order to utilize this optimal choice of k^, one must know both cr and a i . But if 
this information is available, there is no estimation problem. It may also be noted that 
in this canonical representation, the matrix (W T W + K) is diagonal. Thus, the estima- 
tion of a reduces to the independent estimation of the components of a. 

In the preceding form of ridge regression there are p k T s to choose. In order to 
provide a reasonably tractable method of analysis, Hoerl and Kennard consider also the 
model 


y = Xb + e 

where it is assumed only that the X matrix is scaled such that X T X has diagonal ele- 
ments equal to unity. That is, they consider the model in its correlation form. The 
more specific form of ridge regression is then defined by the family of estimators 

b(k) = (X T X + kl)~ 1 X T y kM (21) 

Hoerl and Kennard have shown that this family has the property that there always 
exists a k > 0 such that 

M(k) = E j[b(k) - b] T [b(k) - b]J 

= o 2 \ — h. + k 2 b T (X T X + kl)' 2 b 

/ i (\ + k) 2 

i 

is minimized. It may be shown also that M'(0) < 0. There is no way yet developed for 
determining the ’’best” k. Hoerl and Kennard r s suggested procedure for choosing an 
estimator from this class is to plot the elements of b(k) for a number of values of k 
between 0 and 1. The "best" value of k is then subjectively chosen as the point at 
which these curves begin to stabilize. They have not shown that this procedure even 
guarantees a reduction in M(k) let alone a minimum. 

In what follows, we will be interested in the quantities S?b(k) as indicated in the 

T * 

section Duality of X X. We obtain immediately 
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S?b(k) = S?(X T X + kI) _1 X T y 




sHx T y 


— — S T X T V 
X, + k b i x y 


v i 


X. + k 


T T 

where v. = S i X y. We also have 


S?b(k)] = — - — S?'x T E(y) 


A. + k 


( 22 ) 


‘‘ sFb 


Xj + k 1 


( 23 ) 


and 


V 



S?X T (a 2 I)XS. 

rt X 1 


(Xj + k)' 


V 


(Xj + k)‘ 


( 24 ) 


T- 

Thus, for any nonzero k, b(k) is the least biased linear combination of the estimator 
and Spbfk) is the most biased. Also S^b{k) has the least reduced variance, and S T b(k) 
has the most reduced variance. Thus, the best determined linear combinations of the 
parameter estimates are the least modified, while the least determined are the most 
modified. In effect, as k increases, those S i b(k) corresponding to small A. are 
rapidly driven to zero. The estimators are thus constrained estimators. 
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Marquardt’s Generalized Inverse Estimators 


Marquardt (ref. 7) discusses a method of applying generalized inverses to biased 
estimation. He also considers some relations among these estimators, ridge estimators, 
and nonlinear estimation. He considers the model of equation (1) where the X matrix 
has been scaled so that X T X is in the correlation form. His family of estimators is 
indexed by a parameter p where 0 ^ p ^ p. The family is defined by 

b(p) = (X T X)+X T y (25) 


The matrix (X T X)£ is defined as follows: Let p* = [p] denote the greatest integer in 
p and dp = p - p* . Then (X T X)t is defined as 

b* 


(X T X)+ = 


p* 

3=1 


— S-S T + 
3 3 


dp 
L p* + 1 


S p*+l S p*+l 


= Gp ( 26 ) 

As the notation is meant to indicate, (X^X)* is closely related to a generalized in- 
verse of X T X. In fact, if r = rank(X T X), then (X T X)^ is the Moore- Penrose gener- 
alized inverse of X T X and is unique. An important point to note is that it is well known 
that the Moore- Penrose generalized inverse yields the minimum-norm solution to the 
normal equations (ref. 18, p. 50). 

Marquardt’s estimators thus provide a sort of minimum -norm solution to the normal 
equations. He pursues this idea in some depth in reference 7. He also shows that there 
always exists a 0 < p < p such that 


M(p) = e([6(p) - b] T [b(p) - b]} 

is minimized. It is also shown that M'(p) > 0 so that the mean squared error of b(p) 
is initially decreasing as p decreases from p. As with the ridge estimators, there is 
no way yet developed for determining the ’’best” p. 

With X scaled so that X T X is in the correlation form, Marquardt labels the 
diagonal elements of (X^X)^ as variance inflation factors. His suggested analytical 
procedure is to consider several estimates b(p) for p between p and 0. He suggests 
the rule of thumb that an acceptable value of p is one such that the maximum variance 
inflation factor should usually be larger than 1.0 but certainly not as large as 10. 0. 
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Marquardt has not been able to show that this procedure even guarantees a reduction in 
M(p) let alone a minimum. 

For these estimators we have 


C. 


0 p < i - 1 


S ?b(p) = sT (X T X)^X T y =^i 


S i T X T y i - l<p<i 


(27) 


— sJx T y i < p 

X. 1 

l 1 


It may rather easily be shown that 


r 


0 p < i - 1 

E[s?b(p)] = J (dp)s7b i - 1 < p s i 


(28) 


S. b i < p 




and 


f5 Mi-l 


(dp)V 


v[s^b(p)] = J \ 


i - 1 < p < i 


— i <P 




(29) 


From equations (28) and (29) we see the following behavior as p decreases from 
p = p: The S?b(p) are successively set to zero in order of increasing Xy The best 
determined linear combinations of the parameter estimates are the last to be set to zero, 
while the least determined are the first to be set to zero. 
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Shrunken Estimators 


Mayer and Willke (ref. 8) discuss several families of biased estimators which may 
all be labeled shrunken estimators. They consider the model of equation (1), y = Xb + e, 
but do not require that X T X be in correlation form. Each family of estimators is 
indexed by a parameter 1 > c > 0 and defined by 

b(c) = c(X T X)"*X T y = cb (30) 


where b is the ordinary unbiased least- squares estimator. If the constant c is a 
scalar fixed in advance of the analysis, then b(c) is called a deterministically shrunken 
estimator. If c = f(b T b) is a scalar function of the least- squares estimator, then b(c) 
is called a stochastically shrunken estimator. 

It may be shown that there does exist a value of 0 < c < 1 such that 

M(c) = E j[b(c) - b] T [b(c) - b]j 

A 

is minimized. Consider the stochastically shrunken estimator b(c), where 


and 


^[wsW 1 ] ' 

s 2 = y T y - b T (X T X)“ 1 6 

p > 3 > 

0 < £ < 2(p - 2)(n - p + 2)"^ 


This estimator is one discussed by Sclove (ref. 4). Then if we define 

W[b(c)] = E[[b(c) - b] T [b(c) - b]j 
it was shown by Sclove (based on results of refs. 2 and 3) that 


(31) 


k 



P - 2 

n - p + 2 
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minimizes W[b(c)]. This is the only biased estimator known to this author for which 
a choice of biased estimator can be explicitly given that guarantees a reduction in mean 
squared error. 

Mayer and Willke develop another family of stochastically shrunken estimators 
indexed by 5^0 


b(5) = 5bb T (I + 5bb T ) _1 b . 

Few properties of this estimator are known. Mayer and Willke suggest the user plot the 
components of b(6) as a function of 6 and choose that 5 where the curves begin to 
stabilize. It is not known whether this procedure provides an estimator with smaller 
mean squared error than b. 

For purposes of comparison to the other biased estimation procedures we will con- 
sider only the deterministically shrunken estimator b(c) = cb. For this choice we have 

sjb(c) = csTb = -1 sTx T y ( 32 ) 

\ . 

1 

Ejstbfc)] = cstb (33) 

and 

v[s7b(c)] = 5^! ( 34 ) 

1 

From equations (33) and (34) we observe that, unlike the ridge and generalized inverse 
estimators, all linear combinations of the parameter estimates are driven toward zero 
proportionately and that the variances are also proportionately reduced. 


Principal Components Estimators 

Principal components regression (or canonical regression as it is sometimes 
called) does not appear to be widely used in the physical sciences although it is appar- 
ently used in economics and the social sciences. The description of principal compo- 
nents regression will be primarily along the lines of Kendall (ref. 9) and Massy (ref. 10) 
although there may be some differences. 
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We begin with the model of equation (1), y - Xb + e, and make an orthogonal trans- 
formation of this model to that of 


y = Wa + e 


where 


X T X = PAP T 
P T P = PP T = I 
W =XP 


and 


a = P T b 

The vector of parameters a is then estimated by 

a = (W T W) _1 W T y 

= A " 1 W T y (35) 

Since A is the diagonal matrix of eigenvalues of X T X, each component of a is inde- 
pendent. This is the principal reason for the transformation. Massy discusses two 
methods of obtaining estimators for b from this form of the model. The first method 
consists of setting to zero the components of a which correspond to "small" eigen- 
values. Let a(d) denote the resulting estimate of a. (That is, the components of a(d) 
are equal to zero for those components with small X^ and are equal to the corresponding 
components of a for those with large X. . ) We then define the estimator for b as 

6(d) = Pa(d) 

It should be noted that this is precisely Marquardt’s generalized inverse estimator when 
p is an integer. Kendall also uses this method. We will not refer to this method as 
principal components estimation. 

The second method discussed by Massy is to test the components of a for signifi- 
cance from zero. This may be done, for instance, by the usual t-tests. Other methods 
are discussed by Kennedy and Bancroft (ref. 19) and Holms (ref. 20). In this case we 
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will obtain an estimator for b which we will still denote as b(d) and define as 

b(d) = Pa(d) (36) 

For either of these methods we obtain 


s?b(d) = s^patd) = 1 


r T7 s I xT y 


a ; (d) * 0 


0 a,(d) = 0 


v: 


(37) 


depending upon whether the i 1 component of a(d) is nonzero or zero. The expecta- 
tions and variances of these quantities are dependent upon the method for choosing a(d). 
In particular, they will be the same as for Marquardt's estimators if Massy's first 
method is used. If his second method is used, these quantities would be difficult to 
obtain. 

It should be noted that obtaining a(d) by the second method will usually be on good 
statistical footing because the independence of the components of a permits easier 
analysis than the nonindependent situations more commonly found in regression. 


Discussion 

There are five points we shall touch upon in this section: 

(1) In point estimation, consideration should be given to the estimable functions of 
the parameters. 

(2) Biased point estimators all place some form of constraint upon the parameter 
space . 

(3) The decision to use mean squared error as a criterion for the goodness of esti- 
mators should be based on the objectives of the data analysis and be made independent 
of the condition of X^X . 

(4) If mean squared error is to be the criterion of goodness, then Sclove's estimator 
is the only one which has any proven optimality properties. 

(5) Because of the lack of distributional information, no biased estimator provides 
interval estimation capability. 

We now consider these points in more detail. 

Estimability - As the section LINEAR MODEL ESTIMATION PROBLEM pointed 
out, when the X X matrix is precisely singular there is no unique estimator for b. 
Instead we should take the approach of examining estimable functions of the parameters. 
The major problem in application would be that of choosing meaningful estimable func- 
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tions. There seems to be no good reason for abandoning this approach as soon as X X 
becomes the least bit nonsingular. What seems most reasonable is that less and less 
attention need be paid to estimability considerations as X^X deviates more and more 
from singularity. The transition of viewpoint should be smooth rather than a quantum 

leap. 

Constraints . - Each method of biased estimation (including subset regression - 
which is also, of course, a form of biased estimation) introduces constraints on the 
parameter space. In the analysis of variance, the model is typically over parameter- 
ized. But it is over parameterized in such a manner that certain constraints are 
"natural. " In the regression situation, "natural" constraints seem unlikely to present 

themselves. 

The constraints imposed by biased estimators are as follows: The generalized in- 
verse estimators of Marquardt and the principal components estimators of both methods 
drop linear subspaces out of the parameter space. The subspaces are defined by the 
constraints that sjb(p) = 0 or S?b(d) = 0 (eqs. (27) and (37)). Which particular sub- 
spaces are dropped out, of course, depends upon how it is decided to drop them. We 
then choose minimum- norm solutions in the subspaces remaining. Ridge estimators 
have a very closely related behavior. From the expressions for b(k) (eq. (22)), it 
may be seen that for the linear combinations corresponding to large A i the addition of 
k to the denominator has a lesser effect than for those with small A_ Each combina- 
tion, however, is driven to zero, with those with small A^ getting there quicker. The 
shrunken estimators are more difficult to characterize, but there are constraints none- 
theless. 

The final comment along these lines might be that ordinary subset regression tech- 
niques also provide biased estimators and also impose constraints. The constraints are 
that certain components of b are set to zero and as such have an extremely clear in- 
terpretation. 

Condition of X T X . - Following the comments about estimability and constraints, it 
seems that biased estimation is not necessarily a good means of removing the symptoms 
of multicollinearity. It has been belabored in references 5 to 8 that one consequence of 
multicollinearity is that the estimators tend to be "too large. " Since the least-squares 
estimator is unbiased, it should also possess to some degree the tendency for estimators 
to be "too small. " In fact, much of the confusion can probably be attributed to misin- 
terpretations of the type found on page 58 of Hoerl and Kennard (ref. 5). They state, 
"However, the relationships in section 2 show that on the average, the distance from 
(b) to (b) will tend to be large if there is a small eigenvalue of X^X. In particular, the 
worse the conditioning of X^X, the more b can be expected to be too long. " The first 
Statement is not correct since unbiasedness implies the average distance is zero. 
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Thus, mean squared error has been proposed as a criterion of goodness when X X 
is ill conditioned. But it is either a good criterion or a bad criterion independent of 
X T X. 

Sclove T s estimator . - Of all the biased estimators, only Sclove’s can claim a guar- 
anteed improvement over the least- squares estimator. Principal components estima- 
tion based on significance testing should offer possibilities for improvement. This is 
only conjecture at this point. Ridge estimation, generalized inverse estimation, and 
shrunken estimation do not show how to choose better estimators. Their proponents 
show only that better ones exist. For Bayesian statisticians, there are comments in 

references 5 and 7 which indicate these might be useful estimators. This is so because 

2 

a Bayesian statistician assumes the user is able to specify prior information about a 
and b. 

Interval estimation . - Until such time as the distributional properties of the biased 
estimators are discovered, they cannot be used to provide interval estimators. Only 
least-squares estimation without subset regression provides the necessary distribution 
theory. 


MEAN SQUARED ERROR OF ESTIMATED REGRESSION FUNCTION 

One of the primary objectives of a linear model analysis is to provide a predictive 
equation. Since the biased estimators discussed in the previous section all show that 
there exist members in their respective classes with smaller mean squared error, it 
seems likely that their use could also provide predictive equations with smaller mean 
squared error. It will be seen that, in fact, this can be shown to be true for most of 
the biased estimators. In order to provide a common reference point, we also present 
the mean squared error of the least- squares predictive equation. Since this latter pre- 
dictor is unbiased, the mean squared error reduces to the variance. 


Least Squares 

For any estimator b* of b, we will consider the use of b* to predict the esti- 
mated regression function at a set of values of the predictor variables denoted by Xq 
( assumed to be a 1 x p vector). We recall the model 


y = Xb + e 

For an estimator b* based on this model and data, the predicted regression function 
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at Xq is 

y 0 = x 0 b* 

and the mean squared error of this predicted regression is denoted by 

M P (y 0 l 6 *) = E [(V x o b > T (V x o b >] 

= E je T X(X T X)" 1 X T XqX 0 X(X T X)" 1 X T eJ + 0 


= y 1 (b*) + r 2 < b * ) <38) 

where y^(b* ) corresponds to the variance and ygfa* ) t0 the b * as S( l uare d- For the 
least- squares estimator, b* = b is unbiased and consequently y^ = Xq 6 is unbiased 
for Xgb. Thus, 

b = (xVxTy = b + (X T X) _1 X T e 


and 


M p(y 0 |b) = E[(y 0 - X 0 b) T (y 0 - X 0 b)] 
= X 0 (X T X)‘ 1 X 0 a 2 


(39) 


(see ref. 1, p. 56). 


Ridge Estimators 

For the ridge estimators we recall that 

b(k) = (X T X + kI)' 1 X T y 


Thus, 


y Q = X Q b(k) = X Q (X T X + kI)“ 1 X T (Xb + e) 
= X Q Zb + X Q (X T X + kl)” X X T e 
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where Z = (X T X + kI) _1 X T X. Thus, 

Mp[y 0 |b«] ■ e K - x o b > T <yo - x o b )] 

= e[[x 0 (X T X + kI) _1 X T e + X 0 (Z - I)b] T [x 0 (X T X + kI) _1 X T e + X Q (Z - I)b]} 

= E je T X(X T X + kI) _1 XpX 0 (X T X + kI)* 1 X T e]+ b T (Z - I)xJx Q (Z - I)b 
= y 1 [b(k)] + r 2 [b(k)] 

Theorem 1: The variance function y^[b(k)] is a monotonically decreasing function 
of k and y f [b(0)] < 0. 

Proof: Note that if we assume e ~ N(0, cr^I), then y^ is the expectation of a quadratic 
form in e. Thus (ref. 13, p. 55), 

yj[b(k)] = CT 2 tr[x(X T X + kl)' Jx o (X T X + kI)" 1 X T ] 


Note that (X T X + kl)~ 1 
and hence 



X(X T X + klj^Xg = X"' — i— xs^xj 

i 


which is n x 1. Thus, 
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yjfrOO] 


= tr 



^ xs A T x » 



— 5 — x n s i s?'x T ' 

X. + k U ] J 


V y tr r(xS.sj r xJ)(x 0 S i S?’x T ) 

LuLu (x i + k)(x j + k) LV 3 ] oyv 0 1 1 i. 


i ] 


\ V I (x 0 s s T x T )(xs 1 s T xJ) 

/ ./ , (X i + k)(X + k)V On /v ] J 0/ 


1 ] 


II 

i j 


(A. + k)(\j + k) 


X, 


vT ( E 

\ m 


a s s T ]s.s'! r 

m mm 3 j 


I 


W?xJ 


— ' (A. + k)‘ 
i 1 


Note that 

y x [b(o)] = (^Xg^xj^xJ 

VjfbtO)] = 0 


and 


y^[b(k)] = -a 2 V' — XqSjST’xJ < 0 

(X + k) 3 

1 1 

Thus, yj is a monotonically decreasing function of k, as was to be shown. 

Theorem 2: The bias function r 2 [b(k)] satisfies y 2 [b(0)] = 0 and y£[b(0)] = 0. 
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Proof: Recall y 2 [b(k)] = b T (Z - I)xJx 0 (Z - I)b. Since 

Z - I = (X T X + kI) _1 X T X - I 



s.s? 

11 


we obtain 


y 2 [b(k)] 


y / + k )^j + k ) 

i 3 


Let 


Then 


f tj (k) = 


(Xj + k)(X. + k) 


*ij« 


k 2 (X. + X.) + 2kX,X. 
J LI => o 

(L + k) 2 (Xj + k) 2 


Thus, it is easily seen that r2[b(0)] = 0 and y^[b(0)] = 0. 

Theorem 3: Mp^yQ|b(k)j is initially decreasing in k. 

Proof: The result follows directly from theorems 1 and 2. 

From theorem 3 we thus have the result that it is theoretically possible to reduce 
the mean squared error by slightly increasing k. We have the same drawbacks as for 
ridge estimators. Namely^ how to specify better estimators is unknown; and they will 
typically be functions of a and b, which are assumed to be unknown. 


Marquardt's Generalized Inverse Estimators 
For these estimators we recall that 
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and 


b(p) = G p X T y = (X T X) p X T y 


H P ■ G p xTx -Z s j s j T + ^ s p* + i s J* + i 

3=1 


Then since 


y = Xb + e 

y Q = X Q b(p) = X 0 G p X T y = X 0 G p X T (Xb + e) = X Q H p b + X 0 G p X T e (40) 

The expected mean squared error of the predicted regression function is 
M p[y 0 |b(p)] = E[(y 0 - X 0 b) T (y 0 - X Q b)] 

» E {[ X 0 G p X 0 e + X 0< H p - I)b] T [x 0 G p xT e + X 0 (H p - I)b]j 
= E(e T XG p X T X 0 G p X T e) + b T (H p - I)xJx Q (H p - I)b 
= r^bfp)] + r 2 [Wp)] 

Theorem 4: The variance function y^[b(p)] is a monotonically increasing function of 
p and yl[b(p)]> 0. 

2 

Proof: Since we assume e ~ N(0, a I), we have that 

V^bfp)] = cr 2 tr(xG p x’’x 0 G p X T ) 

From equation (26) for G we may show 

r 


p * 

XG P X 0 - V 7 xs i s i Tx o + ™ P * + 1 S P* + : 

Z — J A i p* + 1 
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rT , 


and XGqXq is an n x 1 vector. Thus, 


l r i]xG,x;;'i(X I) G (J X T ) , (y., l C lj K')(x<^j - 2^ ^ i X 0 S,s’'(x' r X | 3 ]S ; r z; 


yfw; 


1 3 


T X T xs p * +1 sJ* +1 x 0+ ) f x 0 s p * +1 sj, +1 x T xs j S- r x 0 


It 

) 


I (^p) V q qT yTyn qT Y 

. x O b p*+l b p*+l x Xb p*+l b p*+l X 0 


7)*+l 


Hence, 


rx[b(p)] = 


p* 

ill 1 


T + (gjPT X q qT 

n . x 0 b p*+l b p*+l x 0 

V+i 


and 


y^MO)] = 0 

This function is continuous from p = 0 to p = p, and the derivative is continuous for 
nonintegral p. Between the two integers p* and p* + 1 we have 

ri [b(p)] = ^> X 0 s p * +1 sj, +1 x£ > 0 

v+1 

Theorem 5: The bias function r 2 [b(p)] satisfies y 2 [b(p)] = 0 and y£[b(p)] = 0. 
Proof: We have 


y 2 [b(p)] = b T (H p - I)x£x Q (H p - I)b 


(41) 
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where 


V I = - f, S i S j r + (dp - DV +1 S P* + 1 

j=P* +2 

and the sum is null if p* + 2 > p. Thus, equation (41) may be expressed as 

r 2 [b(e)> £ £ b T s i sTxJVj s ] T b - f) (dp-i)b T s p * + iSj* + 1 xJx 0 SiST b 


i=p*+2 j=p*+2 


i=p* +2 


£ (dp- D^SjfxJx^^sT + 1„ 
j=P* +2 


+ (dp - D 2 b T S p * +1 sJ* +1 xT Xo S p , +1 sT +1 b 


From the preceding, it may be verified that, since dp approaches 1 linearly with 
p, YgS 1 (p)] = u P on differentiating with respect to dp, we also may verify that 
y^[b(p)] = 0. 

Theorem 6: The mean square error function Mp|yo|b(p)J is initially decreasing as 
p decreases from p = p. 

Proof: Immediate from theorems 4 and 5. 

Theorem 6 indicates that it is possible to improve the mean squared error of the 
predicted regression, at least initially, by decreasing p. However, we are unable to 
specify how much to decrease p. Also, optimal values of p (if they exist) would be 
expected to be functions of a and b, which are assumed to be unknown. 


Shrunken Estimators 

For any of the shrunken estimators, we have that 

b(c) = cb 0 s c 5 1 


and hence 


y 0 = cX 0 b = cX 0 (X T X)' 1 X T (Xb + e) 
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Thus, 


Mp[y 0 |b( C )] = E[(y 0 - X 0 b) T (y 0 - X Q b) 

= e|(c - l)X Q b + cX 0 (X T X)' 1 X T e] T [(c - l)X Q b + eX () (X T X)' 1 X T e]| 

= E[c 2 e T X(X T X)' 1 xJx 0 (X T X)' 1 X T e] + 2e[(c - l)cb T xJx 0 (X T X)' 1 X T e 

+ e[(c - l) 2 b T xJx 0 b] 

For stochastically shrunken estimators, these expectations may be somewhat difficult. 
For a deterministically shrunken estimator, c is a constant and M [b(c)] is easily 
found to be 

M p [y 0 |b( c )] = c 2 E[e T X(X T X)- 1 X^X 0 (X T X)' 1 X T e] + (c - l) 2 b T xJx 0 b 
= r 1 [b(c)] + r 2 [b(c)] 

Theorem 7: The variance function y^[b(c)] is a monotonically increasing function of 
c > 0 and yj[b(l)J 0. 

Proof: Since y^ is simply a positive constant times c^, it is immediately seen that 
y^ satisfies the stated conditions. 

Theorem 8: The bias function ygfbfc)] is a monotonically decreasing function of 
c for 0 < c :< 1. 

o 

Proof: Since yg is simply a positive constant times (c - 1) , the result follows 
immediately. 

Theorem 9: MpJyplMc) is initially decreasing as c decreases from c = 1, and 

there is a unique minimum for some 0 < c < 1. 

Proof: Immediate from theorems 7 and 8. 

Theorem 9 states that an optimal choice of c exists. However, this optimal value 

2 

of c will be a function of a and b. It should be noted that the stochastically shrunken 
estimator given by equation (31) (i.e. , the estimator discussed by Sclove) seems to be 
the most likely one for which some optimality property of the predicted regression 
function will be achieved. 
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Principal Components Estimators 

The principal components estimators of the second type depend upon the particular 
method used for determining the significance of the coefficients. We will not consider 
the mean squared error of the predicted regression function in this report. Kennedy 
and Bancroft (ref. 19) and Holms (ref. 20) have considered two different procedures for 
subset regression in the principal components case. Both references are concerned 
with the prediction problem. There is no clear way of comparing their results to our 
results since most of their results are based on Monte- Carlo simulation studies. 


Discussion 

From the preceding developments we can draw the conclusion that all the biased 
estimators discussed offer the possibility of decreased mean squared error of the esti- 
mated regression function. This possibility can be realized only in the event that we 
can identify particular better members of each class. At the current state of the art, 
there is no way known to do this. The two most promising possibilities seem to be 
principal components and Sc love's shrunken estimator. Principal components is appeal- 
ing because of some Monte- Carlo simulation work reported in references 19 and 20. 
Sclove’s estimator is promising in the sense that the estimator for the regression coef- 
ficients can be proven to have smaller mean squared error than the least- squares esti- 
mator. 


HYPOTHESIS TESTING 

The third major objective of a linear model analysis is to determine if some of the 
parameters in the model can reasonably be set to zero. To do this, the distributional 
properties of the estimators must be known in order to perform significance tests . For 
the ridge, generalized inverse, and shrunken estimators, this information is not avail- 
able. Thus, hypothesis testing is not possible. For the principal components esti- 
mators, distributional properties of the estimators of the transformed model are better 
known. However, an experimenter is more interested in hypothesis testing in terms of 
the original parameterization. And it is most unlikely that a subset regression in the 
transformed model will also correspond to a subset regression in the original parameter- 
ization. 

Ordinary least squares seems to be the only method available for subset regression 
in the original parameterization. Simultaneous deletion of variables (and parameters) 
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based on confidence ellipsoids (for parameters) would be a procedure with a good statis- 
tical basis. The more usual subset regression procedures based on stepwise regression, 
or some such, have the drawback that they involve nonindependent repeated significance 
tests. Theoretical developments are difficult, but there has been much work in the area 
based on simulations. To this author, the usual least-squares subset regression pro- 
cedures are the most appealing. 

EXAMPLES 
Example 1 

We use the example studied in some detail by Marquardt (ref. 7). The linear re- 
gression model is 

E(y) = b-jXj +b 2 x 2 

and the estimated model is 

y =b*x x + b2|x 2 

where the b* are to be estimated by several methods for comparison. The observed 
data are 

3/5 y 1/2 4/5 ^1/2 

X= 4/5 Vv/2 3/5 yT/2 

5/5 >7/2 5/5^172 

T 

y = 2 

.3. 

The eigenvalues and associated normalized eigenvectors of X T X and (X T X) -1 are 
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X T X : 



1.98 

0.02 


(X T X) _1 



0.505 

50.00 


s[ = (V 2/2, y'i /2) 

s 2 “ (V2/2,-^2/2) 

Sj = (Vi/2, ^2/2) 

sj = (V2/2,-V2/2) 


T 

It is clear from these eigenspace descriptions that X X has the formal rank of 2 but is 
essentially of rank 1. In the parameter space, the variance of the linear function 
^/2(b 1 + b 2 ) is about 1 percent of the total variance, whereas yi/2(bj - b 2 ) has about 
9$ percent of the total variance. This is equivalent to saying that ^2/2(b 1 + b 2 ) is well 
determined, while ^2/2 - b 2 ) is not. We drop the factors ^2/2 and thus consider 
only + b 2 and bj - b 2 for simplicity. 

Marquardt performed ridge regressions for various values of k, generalized in- 
verse regressions for various values of p, and the two best subset regressions. Some 
of the results of these analyses are presented in table in. 

Table m(a) presents the ridge regression results for several values of k given in 
the first column. The next two columns give the individual estimates of and b 2 « 

The following columns give the estimates for b^ + b 2 and - b 2 . The values of k, 
bp and b 2 are an abbreviated set of results taken directly from Marquardt. The es- 
timates + $ 2 and bj - b 2 are not given by Marquardt. It may be noted that b^ - t> 2 
is much more rapidly decreasing as k increases than is b^ + b 2 . This is in accord 
with theory (eq. (22)). 

Table m(b) is more interesting inasmuch as generalized inverse regression has a 
more direct relation to estimable function concepts than does ridge regression. The 
format of the table is identical to that of table IQ (a). It is interesting to note that for 
1 p rs 2, bj + b 2 is unchanged while bj - b^ decreases linearly in p to the value 
zero for p = 1. 0. Then for 0 :£ p 2 s 1, b^ - b 2 remains at zero, while bj + t> 2 de- 
creases linearly to zero as p decreases. This is in accord with theory (eq. (27)). 

Table QI(c) presents the results for the two subset regressions using only or 
x 2 only. The comparison of b^ + $ 2 and bj - b 2 resulting from these two regres- 
sions to the corresponding values from the full regression is most instructive. Note 
that the eigenspace analysis indicates bj + $ 2 should be well estimated and has an es- 
timated value of 3. 6427 from the full regression. The corresponding values from the 
subset regressions are both quite close. But the two values for bj - b 2 are not close 
and in fact the x^-only regression yields a value much closer to the full regression 
value. Based upon this and the comparison between the two residual sums of squares 
the Xj-only subset regression is clearly superior. 
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It should also be noted that Marquardt chooses the ridge estimator with k = 0. 2 as 
the best ridge estimator. This yields a residual sum of squares of 0. 887. He also 
chooses the generalized inverse estimator with p = 1 as the best of that class of esti- 
mators. This yields a residual sum of squares of 0.864. The x^-only subset regres- 
sion, however, provides a residual sum of squares of 0. 480. The x^- only, subset re- 
gression also comes within e of not violating the stipulation that both regression 
parameters are expected, because of physical considerations, to be positive. 

We would consider the x-j-only subset regression to be superior to either the ridge 
or generalized inverse estimators since it fits the data better and has a more readily 
interpretable constraint. 


Example 2 

For our second example we consider the data of Gorman and Toman (ref. 21), which 
were used as an illustrative example by Hoerl and Kennard (ref. 5). The data are pre- 
sented in table IV in correlation form. Table V presents the eigenvalues and eigenvec- 
tors of X T X. Although we used the two-digit correlations in all calculations just as 
Hoerl and Kennard did, slightly different eigenvalues were obtained. Our calculations 
were done by using the double-precision version of the IBM SSP EIGEN subroutine 
(ref. 22) on an IBM 360/67. 

Using the data matrix of table IV, we performed ridge regressions for several 
values of k and generalized inverse regressions for several integral values of p. 

These results are presented in tables VI and VH, respectively. 

Table VI provides the following information about the ridge regression results: The 
first column indicates the various values of k for which ridge regressions were calcu- 
lated. For these values of k, the coefficient vectors b(k) were calculated. These are 
not provided in the table. Instead, the values of S?b(k) were calculated. These values 
are provided in the next 10 columns, for i = 1 to 10. The last column gives the re- 
sidual sum of squares for each value of k. Upon examination of the table, it may 
readily be seen that S?U.(k) decreases as k increases. The rate of decrease is most 
rapid for i = 10 and least rapid for i = 1. This is in accord with the theory previously 
discussed (eq. (22)). The bottom row of the table presents the estimator b(k = 0. 25) 
which Hoerl and Kennard chose as their "optimum" ridge estimator. 

Table Vn provides the following information about the generalized inverse regres- 
sions: The first column provides the values of p for which generalized inverse re- 
gressions were performed. For these values of p, the coefficients b.(p) were calcu- 
lated, and these are presented in the next 10 columns, for i = 1 to 10. The last column 
provides the residual sum of squares for each value of p. The bottom row of the table 
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provides the values of S?b(p = 10). As indicated previously, these represent what 
might be called the estimable functions. The effect of changing p is to leave S?b(p) 
unchanged for i ^ p and to set sTb(p) = 0 for i > p. We thus need only to present 
the S?b(p = 10) values for comparison purposes. 

Generalized inverse estimators have not previously been applied to these data in the 
literature. For comparison purposes we will simply choose b(p = 9) as the ’ 'optimal” 
generalized inverse estimator since it provides very nearly the same residual sum of 
squares as the ’’optimal” ridge estimator. This choice also provides a maximum vari 
ance inflation factor of 3.85. 

It may be noted that there is no consistent behavior of the individual components of 
b(p) as p decreases from p = 10 to p = 6. That there is a consistency in b{p) as a 
whole, though, is evidenced by the invariance of the sFbifi) values. 

When Gorman and Toman studied these data, they arrived at a best subset regres- 
sion in which variables 1, 4, 9, and 10 were deleted. We will not consider heie that 
there might be a better subset regression (as indeed there might be). The individual 
parameter estimates for the chosen subset are presented in table VIII. Also given is 
the residual sum of squares for the subset regression. For purposes of comparison, 
let b(d) denote the estimate of b where we have deleted variables 1, 4, 9, and 10^ In 
effect we have set b^ = b^ = bg = b^Q = 0. The table also provides the values of b(d) 
and S?b for comparison purposes. 

We consider three criteria to compare the three ’’optimal” estimators. The first 
criterion is the residual sum of squares value. The second criterion is defined as 
follows : Let 


L T (b* ) = (S^b* , . . . , S^b*) 


where b* is some estimator of b. Now compute the variance -weighted squared dis- 
tance of each of L[b(k)], L[b (p)], and L[b(d)] from L(b). That is, let 






and 
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d d = f, \[ s ^( d ) - s I^f 
1 = 1 


The third criterion is difficult to quantify. It is the number and type of constraint placed 
upon the parameter space to obtain the estimator. For instance, b places no con- 
straints upon the parameter space. The estimator b(d) places certain restraints upon 
the parameter space by restricting certain components of b to zero. The generalized 
inverse estimators impose that certain linear combinations of b be set to zero. They 
also impose the restraint that b(p) be a minimum-norm solution to the modified normal 
equations. The ridge estimators impose the constraint that b(k) be the minimum-norm 
estimator in a class of estimators defined by hyper ellipsoids in the parameter space. 

Table IX presents the three criteria descriptions for the "optimal" members of the 
ridge, generalized inverse, and subset regression estimators. The comparison of 
these criteria cannot be entirely objective. On the basis of residual sums of squares 
and the constraints, it would seem that the subset regression is best. The residual sum 
of squares is lowest, and there are four constraints which are quite easy to interpret. 
The subset regression, however, has a larger d. This indicates that dropping vari- 
ables 1, 4, 9, and 10 has affected the estimable functions more. One interpretation of 
this larger distance is that the estimator b(d) is farther from the center of the hyper- 
ellipsoids which define confidence regions for b based on the full least- squares solu- 
tion. A closer subset regression is that in which only variable 4 is dropped. This 
yields a residual sum of squares of 0. 115 and a distance d of 1. 04. 

Considering the fact that the subset regression with variables 1, 4, 9, and 10 
dropped involves four constraints as opposed to the one constraint on b(p = 9) or the 
difficult- to- characterize constraint of b(k = 0. 25), the performance of the usual subset 
regression procedures seems quite good. 


CONCLUSIONS 

We have considered the five major types of biased estimators that have been pro- 
posed in the literature. These are ridge, Marquardt’s generalized inverse, shrunken, 
principal components, and subset regression. 

We present the biased and unbiased estimators of the parameters in a linear model. 
The presentation centers on a duality of the X T X matrix of the least-squares normal 
equations. The duality is in the sense that X T X in its eigenspace representation de- 
scribes how and how well the data space is covered, while the similar representation of 
(X^X) describes how the distribution of the estimated parameters is spread out in 
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the parameter space. 

We consider biased estimators with respect to all three major objectives of a linear 
model analysis; that is, point estimation of the parameters, estimation of the predictive 
regression function, and hypothesis testing of the parameters. Our major conclusions 
with respect to these objectives are as follows: 


Estimation of Parameters 

1. In a nearly singular system, the full parameter vector is essentially inestimable. 
However, certain linear combinations of the parameters are estimable. 

2. Biased estimators all place some kind of constraints on the parameter space in 
order to achieve "better" estimators. 

3 . The decision to use mean squared error as a criterion of goodness should be 

T 

made independently of the condition of X X. 

4. If mean squared error is to be accepted as a criterion of goodness, then only one 
estimator so far proposed (i.e. , Sclove's) has any proven optimality properties. 

5. Because of the lack of distributional information, no biased estimator provides 
interval estimation capability. 


Estimation of Regression Function 

All the biased estimators discussed offer the possibility of decreased mean squared 
error of the predicted regression function. This possibility cannot be assured (except 
for two special cases of principal components estimators) because it is not known how 
to identify the members of each class of biased estimators that provide smaller mean 
squared error. 


Hypothesis Testing 

Only the ordinary least- squares estimators have enough of the distributional theory 
available to provide subset regression techniques in the original parameterization. 
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The overall conclusion is that ordinary least-squares estimation and subset regres- 
sion methods are still the preferred methods of linear model analysis in the regression 
situation. 

Lewis Research Center, 

National Aeronautics and Space Administration, 

Cleveland, Ohio, January 14, 1975, 

506-21. 
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TABLE I. - WEIGHTS OF RUBBER PLANTS TABLE II. - THREE INDEPENDENT ESTIMABLE FUNCTIONS 


W 1 


w 3 

Function 

Estimator 

1 

-1 

0 

a i - a 2 

?1. - ^2. * 14 

0 

1 

-1 

a 2- a 3 

h. -^ 3 . =54 

1/3 

1/3 

1/3 

y + 1/3 (a j + c <2 + <* 3 ) 

1/3 (y x + y 2 . + ys.) = 72 f 


Normal 

Off- type 

Aberrant 

ni - 101 
via = 105 

yu = 94 

y 2 i = 84 

y 2 2 = 88 

y 3 l = 32 


TABLE IE. - RESULTS OF REGRESSIONS 
FOR EXAMPLE 1 
(a) Ridge regression 


k 

b l 

" b 2 

t>i + b 2 

6l - b 2 

Residual sum 
of squares 

0.00 

5.3569 

-1.7142 

3.6427 

7. 0711 

0.3636 

.02 

3.5709 

.0354 

3.6063 

3. 5355 

. 490 

.04 

2.9638 

.6068 

3.5706 

2.3570 

. 591 

.10 

2.3230 

1. 1445 

3.4675 

1. 1785 

. 741 

.20 

1.9757 

1.3328 

3.3085 

. 6429 

.887 

.40 

1.6836 

1.3469 

3. 0305 

.3367 

1. 188 

1.00 

1.2795 

1. 1408 

2. 4203 

. 1387 

2. 324 


(b) Generalized inverse regression 


P 

b l 

b 2 

fc>i + bg 

bi - b 2 

Residual sum 
of squares 

2.0 

5.3569 

-1.7142 

3.6427 

7. 0711 

0.3636 

1. 6 

3.9427 

-.3000 

3.6427 

4. 2427 

.444 

1.2 

2. 5284 

1. 1142 

3. 6426 

1. 4142 

.684 

1. 0 

1.8213 

1.8213 

3.6426 

0 

. 864 

.5 

.9107 

.9107 

1.8214 

0 

4. 148 


(c) Best subset regression 








Subset 

b l 

b 2 

b l + b 2 

b l - b 2 

Residual sum 
of squares 

X 1 

3. 6770 

0 

3. 6770 

3.6770 

0.4800 

x 2 

0 

3.5355 

3.5355 

-3.5355 

1.5000 
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TABLE IV. - DATA MATRIX FOR EXAMPLE 2 IN CORRELATION FORM 


[All calculations used two-digit correlations] 



TABLE V. - SPECTRAL DECOMPOSITION OF X T X 


Eigen- 


Components of associated eigenvectors, 

S., for 

i = 


values, 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

3.694 

0. 408 

-0.014 

0.387 

0. 064 

-0. 473 

-0.465 

-0.206 

0.250 

-0.036 

-0.365 

1. 533 

.364 

. 050 

. 132 

.034 

-.068 

-.275 

.574 

-.570 

.217 

.254 

1.294 

-.108 

. 806 

. 093 

-.361 

. 027 

.087 

. 191 

.063 

.177 

-.345 

1.054 

-.106 

-.026 

-.209 

.302 

-.205 

.059 

-.216 

-.026 

.868 

-.079 

.971 

-.027 

.242 

. 019 

.854 

. 155 

. 063 

.292 

. 141 

-.197 

-.199 

.668 

-.349 

-.223 

. 612 

-.057 

.022 

.051 

.372 

,440 

.246 

.231 

.358 

. 178 

.291 

-.414 

.001 

- . 228 

-.191 

. 100 

.518 

-.017 

.588 

.220 

-.078 

-.368 

-.482 

-. 187 

-.013 

- . 208 

. 517 

.216 

. 031 

-.482 

. 137 

. 591 

-.063 

. 047 

-.067 

.701 

. 038 

-.078 

. 272 

.252 

-.050 

.070 

. 410 

-. 107 

.009 

-.044 

-.400 

.0781 

. 193 

. 074 

-.039 

-.065 





TABLE VI. - RESULTS OF RIDGE REGRESSIONS FOR EXAMPLE 2 


k 



Estimable functions, 

S?b(k), for 

i = 



Residual sum 
of squares 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

0 

-0.375 

-0.332 

-0.035' 

" 0.206 

-0.065 

0. 245 

0. 287 

0.311 

-0.300 

0.877 

0. 105 

. 02 

-.373 

-.328 

-.034 

. 202 

-.064 

.238 

.271 

.285 

-.262 

.683 

. 108 

.04 

-.371 

-.324 

-.034 

. 198 

-.062 

. 231 

.258 

.263 

-.232 

.559 

. 114 

.06 

-.369 

-.320 

-.033 

. 195 

-.061 

.224 

. 245 

.244 

-.209 

. 473 

. 120 

.08 

-.367 

-.316 

-.033 

. 191 

-.060 

.218 

. 234 

.228 

-.190 

. 410 

. 126 

. 10 

-.365 

-.312 

-.032 

. 188 

-.059 

.213 

. 224 

.213 

-. 174 

.362 

. 131 

.15 

-.360 

-.302 

-.031 

. 180 

-. 056 

. 200 

.202 

.185 

-.143 

.280 

. 144 

.20 

-.356 

-.294 

-.030 

. 173 

-.054 

. 188 

. 184 

. 163 

-.122 

.228 

.154 

.25* 

-.351 

-.286 

-.029 

. 166 

-.052 

. 178 

. 169 

. 145 

-. 106 

. 192 

. 164 

.30 

-.347 

-.278 

-.028 

. 160 

-.050 

. 169 

. 156 

. 131 

-.094 

. 166 

. 173 

.50 

-.330 

-.250 

-.025 

. 139 

-.043 

. 140 

. 120 

.095 

-.065 

. 108 

.204 

1.00 

-.295 

-.201 

-.020 

. 106 

-.032 

.098 

.076 

. 056 

-.036 

. 056 

.268 

b(0. 25)* 

-0.288 

-0. 109 

-0.246 

-0.054 

-0.045 

0.339 

0.055 

0.244 

0. Ill 

0. 126 

0. 164 j 


TABLE VII. - RESULTS OF GENERALIZED INVERSE REGRESSIONS FOR EXAMPLE 2 


P 



Regression coefficients, b^p), for 

i = 



Residual sum 
of squares 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

10 

-0. 166 

-0.223 

-0.361 

-0. 106 

-0.479 

0.837 

0. 289 

0.385 

0.082 

0. 094 

0. 105 

9 

-.526 

-.128 

-.369 

-.068 

-.127 

. 152 

. 120 

.320 

. 116 

.151 

. 159 

8 

-.349 

-.147 

-.355 

-.088 

. 083 

. 164 

. 096 

.402 

. 192 

. 136 

. 171 

7 

-.325 

-. 033 

-.205 

-.030 

.087 

.228 

-.064 

.335 

. 182 

.286 

. 192 

6 

-.376 

-.116 

-.086 

-.030 

. 152 

.283 

-.093 

. 186 

. 187 

. 118 

.222 

Sfb(p) 

-0.375 

-0.332 

-0.035 

0.206 

-0. 065 

0.245 

0.287 

0.311 

-0. 300 

0.877 






TABLE VIII. - SUBSET REGRESSION 
WITH VARIABLES 1, 4, 9, 

AND 10 DELETED 


[Residual sum of squares, 0. 137.] 


i 

Parameter 

Estimable functions 

estimates, 

b i 

sjkd) 

Sjb 

1 


-0. 382 

-0.375 

2 

-0.249 

.725 

-.332 

3 

-.367 

.267 

-.035 

4 


-.472 

.206 

5 

-.546 

-.135 

-.065 

6 

1.08 

-.084 

.245 

7 

.336 

.254 

.287 

8 

.369 

.771 

.311 

9 


.260 

-.300 

10 


.441 

.877 


TABLE IX. - OPTIMAL ESTIMATORS AND CRITERIA OF COMPARISON 


Type 

Residual sum 
of squares 

Distance, 

d 

Constraints 

Subset (drop variables 
1, 4, 9, and 10) 

0. 137 

1. 63 

b l = b 4 = b 9 = b 10 = 0 

Generalized inverse, 
b{p - 9) 

. 159 

.48 

Minimum-norm estimator 
such that b = 0 

Ridge, b(k = 0. 25) 

. 164 

. 49 


Subset (drop variable 4) 

. 115 

1.04 

o 

is 

T 
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